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Abstract 

We describe a parallel hybrid symplectic integrator for planetary system integration that runs on a graph- 
ics processing unit (GPU). The integrator identifies close approaches between particles and switches from 
symplectic to Hermite algorithms for particles that require higher resolution integrations. The integrator is 
approximately as accurate as other hybrid symplectic integrators but is GPU accelerated. 
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1. Introduction 

The field of solar system dyn amics has a timeline of discoveries that is related to the computational power 
available (e.g.. iMorbidellil 12001 ). As computational power has increased over time, so to has our ability to 
more accurately simulate incrementally more complex systems via the inclusion of more sophisticated physics 
or simply a greater number of particles. The interaction between planets and planetesimals subsequent to 
formation is a well posed N-body problem but displays remarkable complexity even in the absence of colli- 
sions, including resonance capture, planetary migration, and heating and scattering of planetesimals. In this 
paper we attempt to more accurately simulate this system in parallel by using the increased computational 
power and device memory recently made more accessible on video graphics cards. 

Our codqj is written for Compute Unified Device Architecture (CUDA) enabled devices. CUDA, im- 
plemented on NVIDIA graphics devices, is a GPGPU (General-purpose computing on graphics processing 
units) architecture that allows a programmer to use C-like programming language with extensions to code 
algorithms for execution on the graphics processing unit. It provides a development environment and Ap- 
plication Programming Interfaces (APIs) for CUDA enabled GPUs specifically tailored for parallel compute 
purposes. It achieves this by exposing the hardware to the developer through a memory management model 
and thread hierarchy that encourages both constant streaming of data as well as parallelization. We will 
discuss porting the code to other parallel computing environments below. 

2. A second order democratic heliocentric method symplectic integrator for the GPU 

Symplectic integrators are useful for planetary system integrations because they preserve an energy (or 
Hamiltonian) that is close to the real value, setting a bound on t he energy error during long integrations 



rlamntom an) that is close to the real value, setting a pound on t he energy erro r durin g long int egrations 
(IWisdom fc Holmanl . [l99ll:l Wisdom. Holman fc Toumalll996l ). See lYoshidal(|l990Lll993h : lLeimkuhler fc Rich! 
( 20041 ) for re yiews of symp l ectic i n tegrators. We hav e modified the second order symplectic integrators in- 
troduced by iDuncan et alj (J1998I ); IChambersI (|1999l ). and created an integrato r that runs in parallel on a 
GPU. We have chosen the democratic heliocentric method (JDuncan et al.l . ll998l ) because the force from the 



Email address: amoore60pas.rochester.edu & aquillenSpas.rochester.edu (Alexander Moore & Alice C. Quillen) 
URL: http://astro.pas.rochester.edu/-aquillen/ (Alexander Moore &; Alice C. Quillen) 
1 QYMSYM is available for download at http://astro.pas.rochester.edu/- aquillen/qymsym 



Preprint submitted to New Astronomy 



July 21, 2010 



central body is separated from the integration of all the remaining particles and the coordinates do not 
depend on the order of the particles. 



F ollowing a canonical transformation, in heliocentric coordinates and barycentric momenta (jWisdom. Holman fc Touma , 
199a ) the Hamiltonian of the system can be written 



H(P, Q) = H Dft (P) + H Kep (P, Q) + H Int (Q) 



where 
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is a linear drift term and Pj is the barycentric momenta of particle i. Here mo is the central particle mass. 
The second term Hxep is the sum of Keplerian Hamiltonians for all particles with respect to the central 
body, 
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where Q^ are the heliocentric coordinates and are conjugate to the barycentric momenta. Here m s is the 
mass of the j-th particle and G is the gravitational constant. The interaction term contains all gravitational 
interaction terms except those to the central body, 
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A second order integrator advances with timestep r using evolution operators (e.g.. lYoshidalll990l ) 
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The Keplerian advance requires order N computations but interaction term requires order TV 2 computa- 
tions. However, encounter detection is most naturally done during the Keplerian step and requires 0{N 2 ) 
computations if all particle pairs are searched for close encoun ters. If the re is no search for close encoun- 



20081) reducing the number of 



ters then the Keplerian and Interaction steps can be switched (jMoore et al 
computations. 

Each of the evolution operators above can be evaluated in parallel. The drift evolution operator requires 
computation of the sum of the momenta. We have implemented this using a parallel reduction sum parallel 
primitive algorithm available with the NVI DIA CUDA Softwar e Development Kit (SDK) 1.1 that is similar 



to the parallel prefix sum (scan) algorithm ([Harris et all 120081 ) 



The Keplerian step is implemented by c omputing / and g functions using the universal differential 
Kepler's equation (jPrussing fe ConwavL Il993h so that bound and unbound particles can both be integrated 
with the same routine (see appendix). The Keplerian evolution step is also done on the GPU with each 
thread computing the evolution for a separate particle. Kepler's equation is usu ally solved iteratively 
until a precision limit is a chieved. However, the Laguerre algorithm ( Conwavul986l) (also see chapter 2 by 
Prussing fc Conwavlll993l) converges more rapidly than a Newton method and moreover converges regardless 
of the starting approximation. We have found that the routine converges to the double precision limit (of 
order 10~ 16 ) in fewer than 6 iterations independent of initial condition. See the appendix for the procedure. 

The interaction terms are computed on the GPU with all iV 2 force pairs evaluate d explicitly in parallel. 
The algorithm is based on the algorithm described by ( Nvland. Harris, fc Prinsl 12008). This algorithm takes 
advantage of fast shared memory on board the GPU to simultaneously compute all forces in a p x p tile of 
particle positions, where p is the number of threads chosen for the computation (typically 128 or 256). The 
total energy is evaluated with a kernel explicitly evaluating all iV 2 pair potential energy terms, similar to 
that calculating all iV 2 forces. 



After the change to heliocentric/barycentric coordinates, the position of the first coordinate corresponds 
to the center of mass and center of momentum. The trajectory of this particle does not need to be integrated. 
However, it is convenient to calculate the energy using all pair interactions including the central mass. The 
interaction term in the Keplerian part of the Hamiltonian can be computed at the same times as Hint if Qo 
is set to zero. Consequently we set Qo = Po = at the beginning of the computation. This is equivalent to 
working in the center of mass and momentum reference frame. Because we would like to be able to quickly 
check the total energy, we have chosen to keep the first particle corresponding to the center of mass and 
momentum as the first element in the position and velocity arrays. During computation of Hi nt we set too 
to zero so that force terms from the first particle are not computed. These are already taken into account 
in the evolution term corresponding to Hxep- The mass is restored during the energy sum computation as 
all potential energy terms must be calculated explicitly. 

The particle positions and velocities are kept on the GPU during the computation and transferred back 
into host or CPU accessible memory to output data files. An additional vector of length equal to the 
number of particles is allocated in global memory on the device to compute the momentum sums used in 
the drift step computation. We limit the number of host to device and device to host memory transfers by 
persistently maintaining position and velocity information for all particles on the device because frequent 
data transfer between the CPU and GPU will reduce overall performance of the code. The maximum 
theoretical throughput of PCI-express 2.0 16x bus technology, the interlink between the CPU and GPU on 
Intel processor based motherboards, is 8 GB/s with a significant latency penalty. This sets an upper limit 
of the total number of particles to ~ 10 7 based on several gigabytes of GPU memorjQ but computation 
time on a single gpu for this number of particles would make such simulations unrealistic. 

While the transfer data rate limitations between the host and device are important, use of global device 
memory should also be monitored carefully. Depending on memory clock speed, width of the memory 
interface and memory type, theoretical maximum global memory transfer throughputs are currently ~150 
GB/s. This is the case for the GT200(b) architecture Quadro FX 5800 and GeForce GTX280/285 GPUs 
in our cluster, which range from 140 to 160 GB/s respectively. Despite enjoying a significantly greater data 
transfer rate compared to that of the PCI-express interlink, the latency penalty of device memory is also 
quite large - from 400 to 800 cycles. Therefore, explicit global memory access should be limited whenever 
possible. Shared memory, which is basically a user controlled cache that can be accessed by all cores on 
an individual multiprocessor, can be used to reduce this penalty. As described above, the interaction and 
energy kernels leverage shared memory and benefit greatly. By streaming information from global memory 
to shared memory, we are able to hide the latency of global memory, further increasing the computation 
speed. Even when all global memory transactions of a warp issued to a multiprocessor can be coalesced 
(executed simultaneously), the latency is at least the aforementioned several hundred cycles. Uncoalesced 
global memory access can be even more costly. However, it is not always possible to write an algorithm to 
access shared memory in a sensible manner, and limitations on the amount of shared memory space may 
force direct global memory access regardless. This is the case for our GPU implementation of the solver for 
the universal differential Kepler's equation. 

Though the maximum number of threads on the video cards we used was 512 for GT200 architecture 
cards and 1024 for GF100 architecture cards, we found that restrictions on the number of available registers 
on each multiprocessor limited all of the major kernels to 128 or 256 threads per block. A more detailed 
review of NVIDIA GPU hardware and programming techniques can be found in the CUDA Programming 
Guide. 



Our first parallel symplectic integrator necessarily ran in single precision (jMoore et all 120081 ) as graphics 
cards were not until recently capable of carrying out computations in double precision. Our current im- 
plementation uses double precision for all vectors allocated on both GPU and CPU. A corrector has been 
implemented allowing accele rations to be compute d in single precision but using double precision accuracy 
for the particle separations (JGaburov et all 120091 ) . If we wrote a similar corrector we could run our inter- 



action step in single precision (but with nearly double precision accuracy) achieving a potential speed up 



2 The linked lists used in the collision detection routines to be discussed enforce a lower realistic limit. 

3 



of a factor of roughly 8 on CUDA 1.3 compatible devices. The newest CUDA 2.0 compatible devices have 
superior double precision capabilities, limiting this potential speed up to a factor of 2. Additionally, it would 
be more difficult to create a corrector for the Keplerian evolution step, consequently the current version of 
the integrator is exclusively in double precision. 

We work in a lengthscale in units of the outermost planet's initial semi-major axis and with a timescale 
such that GM* = 1 where M* is the mass of the central star. In these units the innermost planet's orbital 
period is 2ir, however we often describe time in units of the innermost planet's initial orbital period. 



2.1. Close encounters 

Symplectic integrators cannot reduce the timestep during close encounters wit hout shifting th e Hamil- 
tonian integrated and destroying the symplectic properties of the integrator (e.g.. lYoshidalll990l ). During 
a close encounter one of the interaction terms in Hj nt becomes large compared to the Keplerian term, 
HRep- Consequently, the symplectic integrator described above becomes inaccurate when two mass i ve ob - 
jects undergo a close approach. To preserve the symplectic nature of the integrator iDuncan et al.1 ([19981 ) 
used an operator s plitting approach and decomposed the potential into a set of functions with increasingly 
small cutoff radii. Chambers! (19991) instead used a transition function which because of its relative sim- 



plicity (and reduced numbers of computations) we have adopted here. A transition function can be used to 
move the strong interactio n terms from the interaction Hamiltonian to the Keplerian one so that the entire 
Hamiltonian is preserved ( Chambers! . 1 19991 ) . The Keplerian Hamiltonian becomes 
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and the interaction Hamiltonian becomes 
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where qij = |Qj — Qj| and K(qij) is a transition or change-over function that is zero when the distance 
between two objects is small and 1 when they are large, thus H' K + H' Int = Hjcep + Hj nt . 
The transition function we use is 
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The parameter y becomes zero at q^ ~ 0.lr cr i t where r cr u is a critical radius. 

The above transition function is similar to that chosen by IChambers! ( 19991) but we use a sine function 
instead of a polynomial function as sine is computed quickly on graphics card (esti mated at 4 or 5 floating 
point computations rather than the 8 required for the polynomial function used bv lChamberslll999l ). 



2.2. Hermite integrator 

The numerical integrator used for particles undergoing close approaches also must be reasonably fast as 
we would like to make it possible for our integrator to integrate as many particles as possible. Consequently 
we have chosen a 4-th order adaptive step size Hermite integrat or for particles undergoing close approaches 
instead of the Burlisch-Stoer integrat or use d bv lChambersI ( 1999J ). This integrator forms t he heart of many N- 
body integrators fe-g.- lMakino et al. 2003). Our code follows the algorithm described bv lMakino fc Aarsethl 



(|l992h but does not use the Ahmad-Cohen scheme. While the Hermite integrator runs on the CPU, we have 

4 



implemented the routine to integrate the accelerations and jerks both on the CPU and GPU. The GPU 
version is called if the number of particles integrated exceeds a certain value, np aW itch- 

The Hermite integrator must be modified to use the transition function for computation of both accelera- 
tions and jerks (time derivatives of the accelerations). Using predicted velocities we evaluate the acceleration 
Qi of particle i according to the following 






ILL 

J?. 



,-**(*«)] 



4o 



(9) 



where qy = q, — q^. Here Si 



e 2 and e is an optional smoothing length (see equation 2 by 



Makino fc Aarsethlll992l for comparison). The jerks are evaluated with 
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where Vjj = q^ — q, are the difference between predict position and velocities vectors, respectively, for 
particles i and j; 

The transition radius r cr u was described by IChambersI ( 19991) in terms of the mutual Hill radius or 



Rmh 



mi + rrij 

3Afn 



1/3 



(11) 



though IChambersI ( 19991) also included a expression that depends on the speed. One problem with this 
choice is that the force between particles don't solely depend on the distance between the particles as they 
also depend on the distance to the central star. This implies that the forces are not conservative and makes 
it more difficult to check the energy conservation of the Hermite integrator. Instead we choose a critical 
radius, r cr i t , prior to the Hermite integration and keep it fixed during the integration. The critical radius 
r cr it is defined to be a factor an times the maximum Hill radius of the particles involved in an encounter at 
the beginning of the Hermite integration. 



Tcrit 



= an x max {ru,i for i in encounter list} 



(12) 



where m,i is the Hill radius of particle i. 

When using the Hermite integrator we do not allow the central star to move as we keep the system in 
heliocentric coordinates. We have checked that the total energy given by H' K (equation[6]) is well conserved 
by the Hermite integrator. We find that the accuracy is as good as that using the Hermite integrator lacking 
the transition function and is set by the two parame ters controlling the timestep choice r\ and r\ s (see section 
2.1 and equations 7, and 9 lMakino fc Aarsethlll992h . 

We note that our modified interaction step requires iV 2 computations and adding in the transition 
function would substantially add to the number of computations involved in computing accelerations. We 
instead make use of the list of particles identified during our encounter identification routine to correct the 
forces on the particles that are involved in encounters. Consequently all interactions are computed and then 
only those involved in encounters are corrected just before the Hermite integrator is called. 



2.3. Integration procedure for the Hybrid integrator 
Our procedure for each time step is as follows: 

1. Do a drift step (evolving using H^ft', equation [5]) for all particles for timestep r/2 on GPU. 

2. Do an interaction step (evolving using Hi nt ; equation 0|) using all particles and for all interactions and 
without using the transition function K for timestep r/2 on GPU. 

3. Do a Keplerian step (evolving using / and g functions) for all particles for timestep r on the GPU. 
Note that the ones undergoing close approaches have been inaccurately integrated but will be corrected 
later. 

4. Use stored positions and velocities (qo, vo, qi, vi) prior and after the Keplerian step in global memory 
on the GPU to identify close encounters on the GPU. If there are encounters, transfer pair lists onto 
the CPU and divide the list of particles involved in encounters into non-intersecting sets. Calculate 
the maximum Hill radius for particles in each encounter list and use this radius to compute a critical 
radius r C rit f° r each encounter set. Only if there are encounters are the positions and velocities prior 
and after the Kepstep copied onto the CPU. 

5. For each encounter set, subtract interactions that should not have been previously calculated in step 
#2 using stored positions and velocities on the CPU. Previously we calculated the interaction step 
using all interactions. However we should have weighted them with a transition function K (equation 
12.11) . Now that we have a critical radius, r cr u, estimated for each encounter list we can correct the 
interaction step. After interactions weighted by 1 — K have been subtracted, the interaction step 
effectively calculated is H' Int (equation EJ rather than H] nt (equation H]). 

6. Use a modified Hermite integrator to integrate close approach sublists for r using the stored CPU 
positions and velocities qo,vo. 

7. For each encounter set, subtract interactions that will be incorrectly calculated by a repeat of the 
interaction step in step #9. 

8. If there have been encounters, copy only particles involved in encounters back into the arrays qi 
and vi on the CPU. Copy the entire arrays on the the GPU. Particles involved in encounters have 
been integrated using a Hermite integrator. Particles not involved in encounters have had Keplerian 
evolution only. 

9. Do an interaction step (using Hi nt ) using all particles and for all interactions and without using the 
transition function K for timestep r/2. 

10. Do a drift step (using Ho ft) for all particles for timestep r/2. 

As we have checked for encounters during every time step we can flag encounters involving more than 
one planet. In this case we can choose to do the entire integration step for all particles with the Hermite 
integrator (but with accelerations and jerks computed on the GPU). As planet /pla net encounters ar e rare 
this does not need to be done often. Hybrid symplectic integrators such as Mercury (jChambersl . Il999f ) have 



largest deviations in energy during infrequent planet/planet encounters. By integrating the entire system 
with a conventional N-body integrator during planet/planet encounters we can improve the accuracy of the 
integrator without compromising the long term stability of the symplectic integrator. 

2-4- Identifying close encounters 

Close encounter identification is in general an order ./V 2 computation as all particle pairs must be checked 
every timestep. This is potentially even more computationally intensive than computing all interactions. 
We do this with two sweeps, each one considering fewer particle pairs. The first sweep is crude, covers all 
possible particle pairs and so is order TV 2 . This one should be as fast as possible to minimize its computational 
intensity. Shared memory is used for particle positions in a tile computation similar to that used to compute 
all force interactions. The computation can be done with single precision floating point computations and 
we can be conservative rather than accurate with encounter identification. For each particle we compute its 
escape velocity (and this is order N as it is only done for each particle). A particle pair is counted if the 
distance between the particles is smaller than the sum of a factor times the mutual Hill radius times the 
distance moved by the first particle moving at its escape velocity during the timestep. 



The first kernel call sweeps through all particle pairs but only counts the number of possible interactors. 
An array of counts (one per particle) is then scanned in parallel using the parallel prefix (scan) function 



([Harris et all l2008h available in the CUDPP subroutine library. CUDPP is the CUDA Data Parallel Prim- 



itives Library and is a library of data-parallel algorithm primitives such as parallel prefix-sum, parallel sort 
and parallel reduction] 3 ] The second kernel call then uses the scanned array to address locations to record 
pair identification numbers identified in the crude sweep. As the number of pairs is identified during the 
first kernel call memory requirements can be considered before calling the second kernel call. 

A second more rigorous sweep is done on the pairs identified from the first one. For this sweep we 
use the particle positions and velocities computed using Keplerian evolution at the beginning and end of 
the timestep. We use the third order interpolation scheme described in section 4.4 bv IChambersi ( 19991 ) to 



predict the minimum separation during the timestep which we compare to the sum of their Hill radii. Pairs 
which fail this test are marked. Pairs which approach within a factor as times times the sum of their Hill 
radii are marked as undergoing a close encounter. Using a second scan we repack the list of identified pairs 
into a smaller array. 

After all pairs undergoing encounters have been identified, we sort them into non-intersecting sublists. 
This is done on the CPU as we expect the number of pairs now identified is not large. 

We note that since we used the escape velocity in our first sweep to identify pairs of particles undergoing 
encounters, we could miss encounters between particles escaping from the system and other particles. The 
number missed we suspect would be small. 

The number of pairs identified in the first crude sweep depends on the timestep and the particle density. 
We could consider other algorithms for removing possible particle pairs from consideration as long as we 
keep in mind that the first sweep should remove as many pairs as possible while being as efficient as possible. 

2. 5. List of Parameters 

We review some parameter definitions 

1. The timestep r. As the symplectic integrator is second order the error should depend on r 3 . 

2. The Hill factor an is used to define r crit ( equation IT21 . This parameter is needed to compute the 
transition function K (equation 12. ip and so is needed by the Hermite integrator and to compute 
interactions when there are close encounters (equation [7]) • This parameter is a distance in Hill radii 
of the most massive particle involved in a close encounter. 

3. The Hill factor o,e- Particle pairs with minimum estimated approach distances within a^ times the 
the sum of their Hill radii are identified as undergoing close approaches. 

4. The smoothing length, e#, used in the Hermite integrator. As we do not yet take into account actual 
collisions, this parameter should be small but not zero. A non-zero smoothing length will prevent 
extremely small timesteps in the event of a close approach of two point masses. 

5. The parameters setting t he accuracy of the Hermite integrator 77, and rj s (as discussed and defined by 



Makino fc Aarsethlll992l) . 
6. The number np SW it c h- If the number of particles involved in an encounter is larger than this number 
then the Hermite integration is done on the GPU rather than on the CPU. 

3. Test Integrations 

The chaotic nature of the many-body problem makes it challenging to check the accuracy of any code 
designed to simulate solar or extrasolar systems. There is no simple way to generate analytical solutions to 
a set of initial conditions. However, it is possible for us to run our integrator through a suite of tests and 
compare those results t o the integ r ators that form the basis or our code, namely the the hybrid symplectic 
integrator Mercury by Cha mbers (1999) and the democratic heliocentric modification to mixed variable 



symplectic integrators iDuncan et al.l (|1998f) implemented in the SyMBA package. 



'http://gpgpu.org/developer/cudpp 



One of the most basic tests we ran on the integrator was to check that smaller timesteps ensured greater 
energy conservation for a given set of initial conditions. We use the relative energy error as a metric 
for measuring the conservation of energy in a simulation. This relative energy error is computed with 
the formula AE = (E — Eq)/Eq where Eq is the energy at the beginning of the computation. Indeed, 
we do observe superior energy conservation with smaller timestep sizes with the error scaling 0(t 3 ) as 
expected as the integrator is second order. Two simple test integrations of 1024 particles and identical 
initial conditions with timesteps of 0.1 and 0.01 were completed. Over 10 timesteps of 0.1 the relative 
energy error was AE = —8.429 x 10~ 6 with a per step average energy error of AE = —8.429 x 10~ 7 . Over 
100 timesteps of 0.01 the relative energy error was AE — —8.422 x 10~ 8 with a per step average energy 
error of AE = —8.422 x 10~ 10 . Comparing the average energy error per step for these two test simulations 
indicates an exact scaling with r 3 . 



3.1. Enhanced or Scaled Outer Solar System 

Our primary test integration is an integration of a scaled version of the outer sola r system. Both 
SyMBA and Mercury were tested in this manner ( Duncan et all Il998t IChambersl . Il999l ). The simulation 
consists of the four giant planets in the Solar system but w ith masses increased by a factor of 50 . Previous 
simulations demonstrate that this configuration is unstable (JDuncan. M.J.. fc Lissauer. J.J.I . I1997T) . although 
th ey disagree on th e eventual outcome due to the chaotic nature of solar system evolution. As noted 
by IChambersl (J1999I ). the timestep chosen can have an effect on the eventual outcome even when varied 
only slightly. With these facts in mind, we ran a simulation with values as close as possible to those of 
previous tests. To match the simulation described in section 5.1 bv IChambersl ( 19991 ) we used a timestep 
of r = 0.00255Po where Pq is the initial orbital period of the innermost planet. We set the Hill factor 
oe = 2.5 for encounter detection and an = 1-0 setting the tran sition radius, r^ it- For initial conditions we 



used epoch J2000.0 orbital elements for the four giant planets (jStandish et all 119921 ). The energy error for 



this simulation is shown in Figure Q] Time is given in o rbital periods of t he innermost planet. We ran our 
simulation for the same number of orbital periods as did IChambersl ( 19991 ) in the test shown in their Figure 
2. 

Despit e the highly chaot ic and unstable nature of this system, our integration is remarkably quite similar 
to that bv lChambersI (J1999I ) (see their Figure 2). The energy error is bounded, as expected for a symplectic 
integ rator. The spike s in the energy error are also evident with other integrators (see Figure 2 bv lChambers 
19991 and Figure 6 by iDuncan et al.lll998l ) . A close encounter between Jupiter and Saturn is experienced at 
approximately 200 years into th e integration, whi ch causes a jump in the relative energy error. A similar 
jump in energy was also seen bv IChambersl (J1999T1 . Our simulation also involved the later ejection of more 
than one planet. 

The energy error of our integration is bounded - typical of a fixed timestep symplectic in tegrator. However 
the sizes of the individual spikes in energy error is larger th an those shown in Fig ure 2 bv IChambersl (1999) 
th ough it is s i milar in size to those shown in Figure 6 by IDuncan et al 



The spikes in Figure 2 



bv IChambersl ( 19991 ) are of order 10 -6 in fraction error whereas ours are of order 10 -5 . There are several 
possible explanations for the worse performance of our inte gration. During c lose approaches we are using 
a Hermite integrator rather than the Burlisch-Stoer used bv IChambersl ( 1999T ). However we have measured 
the error across each close approach and find conservation of H' Int at a level orders of magnitude below 
10~ 5 so the choice of integrator for close approaches is unlikely to be the cause. We have checked our drift 
and Keplerian evolution operators and find they conserve their Hamiltonians within the precision of double 
precision arithmetic. We find little dependence on energy error in the form of the transition function, or 
the Hill factor an setting r cr i t . However the Hill factor influencing the identification of close encounters, 
oe, does affect the energy error. During encounters, terms in the interaction Hamiltonian (equation 2]) can 
become large. However we add these into the interaction term during the interaction evolution. These terms 
are then removed subsequently once encounters are identified so that H' int is calculated (see the discussion 
in section 2.1 and steps 3 and 7 in section 2.3). This procedure for removing the incorrectly calculated 
interaction terms could account for the somewhat poorer performance of our integrator. 
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Figure 1: We show the relative energy error {E — Eq)/Eq from an integration of the outer solar system (Jupiter, Saturn, Uranus, 
Neptune) only with masses enhanced by a factor of 50. Time is in units of the orbital period in years of the innermost planet. 
The energy errors are bounded. This behavior is typical of fixed timestep symplectic integrators. There is a planet/planet 
encounter at a time of about 200 years. Eventually planets are ejected. 



3.2. Long term outer Solar system evolution 

For our second integration we compare the relative energy error of the evolution of the outer Solar system 
wi th much larger timest eps and for a much larger amount of time - similar to the test integration discussed 
bv iDuncan et al.l ( 19981) and shown in their Figure 2. In this test, we again use the epoch J2000.0 orbital 
elements as initial conditions but use a timestep of r = 0.0318Po where Pq is the initial orbital period of the 
innerm ost planet. The times tep used and the total integration time (~3x 10 5 yr) are similar to those values 



used in IDuncan et al.l ([19981 ). The masses of the planets are unchanged from their accepted values. This 
configuration is known to be stable so we did not expect any close encounters or ejections and a somewhat 
better relative energy error compared to the last simulation despi te the increase in tim estep size. As in the 
previous test, we did not observe a relative energy error as good as IDuncan et al.l ( 19981) : our accuracy being 



a factor of a few poorer in terms of both the average energy error as well as the size of the fluctuations in 
the energy error. No encounters are present in this simulation suggesting that we have somewhat larger 
sources of errors in our interaction or Keplerian evolution steps than SyMBA. We are not yet sure what is 
causing this low level of error as these computations have been done in double precision and when tested 
individually for single timesteps, we have found them accurate. 

3.3. Sensitivity of Parameters 

We find that the energy error is insensitive to the Hill factor an setting the transition function but is 
quite sensitive to a^, the Hill factor for encounter detection. We find the best numerical results for a& in 
the range 1 to 4. The larger the value of aE the more encounters are sent to the Hermite integrator and the 
slower the integration. However if oe is too small then the difference between the integrated Hamiltonian 
and true one will be large as the interaction terms become large. 
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Figure 2: We show the relative energy error in a simulation of the outer solar system (Jupiter, Saturn, Uranus, Neptune). 
Initial orbital elements are those of the giant planets at epoch J2000.0. Time is given in units of the orbital period in years of 
the innermost planet. 

4. Benchmarks and Profiling 

In this section we discuss the fraction of runtime spent doing each computation in some sample integra- 
tions. Computations that are run on the GPU are called kernels. In Tables Q] and [5] we list the fraction 
of GPU time spent in each kernel or group of kernels or doing memory operations for these two different 
simulations. We also list the CPU runtime for each kernel which includes the overhead for calling the de- 
vice function in addition to the runtime on the GPU. We label the kernels or groups of kernels as follows: 
Interaction (evolving using H int ), Keplerian (evolving using Hkcp), Sweeps (the kernels for finding close 
encounters), Drift (evolving using Hr>ft), Energy (evaluating the total energy on the GPU and only done 
once per data output), and Hcrmite (when the Hermite integrator is run on the GPU). Also listed is the 
total time spent doing memory allocation and transfers (listed under 'Memory'). All other kernels are listed 
under 'Various' and the computation times are summed. Kernels in the 'Various' category include center of 
momentum calculations, scans, repacking, etc that are not part of the previous listed kernels. Three sweeps 
are called, the first two passing over all particle pairs. The sum of the time spent in all sweeps is shown in 
the table under Sweeps. 

The simulations described in Table 1 and 2 are identical except in the number of particles in the sim- 
ulation. The initial conditions consist of 4 massive planets inside a debris disk which is significantly less 
massive than the planets and is truncated relatively quickly. Both simulations were run for 1 output of 
100 timesteps. We note that the energy kernel is only called once per output. The set of initial conditions 
was chosen so that the system would quickly have close encounters involving two planets and so force the 
integrator to call the Hermite integrator with all particles. This allowed us to measure the performance 
of different kernels. Relevant profiler information includes the number of kernel calls made of a particular 
function, total time to complete all calls of that function on both the CPU and GPU and the percentage of 
GPU runtime spent running each kernels. Tables 1 and 2 represent a scenario in which collision are frequent 
and the Hermite integrator is called frequently. 

Profiling for the code was done with the NVIDIA CUDA Visual profiler version 3.0 that is available with 
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the CUDA toolkit. Profiling was done on two video cards, the NVIDIA GeForce GTX 285 and the GeForce 
GTX 480. Tables 1-2 only show the times for computations on the GPU alone and CPU overhead + GPU 
time to call these functions, but do not show the fraction of total computation time on the GPU. However, 
using the GPU Utilization Plot available in the Visual profiler, we are able to determine the session level (an 
entire integration) GPU utilization. For 10 3 particles we achieved a utilization of 93% and for 10 4 particles 
a utilization of 85%. For 10 5 particles, utilization varied dramatically, but never dipped below ^50%. The 
overall utilization dropping for higher number of particles may seem counterintuitive, but makes sense in 
light of the increasing number of particles that are flagged for close approaches and sent to the CPU for 
integration. In fact, in certain cases, particularly with low densities or small hill factor identification radii, 
10 5 particle simulations would display very high utilization. It is important to note that these utilization 
percentages represent the time that the GPU is not idle - it is not indicative of the actual performance 
of a particular kernel or the code as a whole. There are other more useful metrics for individual kernel 
performance. Also, we note that the Hermite integrator is the only major routine in the code that is run on 
the CPU and because of this the host processor speed and quantity of main system memory have little effect 
on the values given in the tables below. The effect of the higher CPU speeds is to decrease the total runtime 
and to effect the GPU utilization values. Only in simulations with lower GPU utilization percentages do 
the effects of increased processor speed become apparent. Altering main system memory amounts or speeds 
has no practical effect on our runtime due to the relatively small amount of memory being used - even for 
10 5 particle simulations. 

We observed that nearly all of the runtime is dominated by the interaction step, the Hermite integrations 
and the close approach detection kernels labeled "Sweeps" in the table. Memory operations, the drift and 
Keplerian evolution steps of the symplectic integrator and all of the other various functions on the GPU add 
up to only a small fraction of the runtime. This percentage will continue to decrease as the particle number 
is increased as the interaction step is 0(N 2 ) but the Keplerian evolution and drift step are O(N). The ratio 
of time spent doing sweeps compared to that in the interaction step may increase with N as there may be 
more encounters when the particle density is higher. Comparing Table [1] and [2] we see that the fraction 
of time spent in the interaction step is only somewhat larger when the number of particles is larger. This 
implies that the fraction of time spent doing operations that are O(N) is small even when the number of 
particles is only 1024. 

Examining the timing information from the simulations with 10240 particles we note that on average a 
single call to the Hermite and interaction kernels took about 0.20 and 0.11 seconds respectively. We also 
observe that the CPU + GPU runtime are nearly identical to the GPU runtime alone for all major kernels. 
It is only with smaller numbers of particles or kernels that are O(N) that these values diverge even slightly. 

In Table|3]we show a comparison of simulations with three different particle numbers and identical initial 
conditions as those given for the first two sets of simulations described in the first two tables. However, 
these were evolved for only 10 timesteps instead of 100 before outputting data. These integrations lack 
close encounters and so serve to compare sweep, energy and interaction kernels. For simulations with sparse 
debris disks or low mass objects, this third table describes how the GPU runtime will be spent. Clearly, 
the interaction step, energy calculation and sweeps compose a majority of the runtime. As larger number of 
timesteps are taken per data out or if we suppress the calculation of the energy, we will reach an asymptotic 
limit of 85% GPU time spent on interaction step with the remaining 15% of the GPU time spent on the 
sweeps for encounter detection] 4 ] 

4-1- Optimizations 

We have optimized our code beyond the standard CPU software optimization techniques by using some 
of the "best practices" for GPU programming. Refer to the "Best Practices Guide - CUDA 3.0" for an in 
depth description of low, medium and high priority optimizations. The most important and most obvious 
best practice is to run as much of your code on the GPU as possible while simultaneously implementing 
each kernel to take advantage of as much parallelism as possible. To this end, we have put nearly all of 



4 GPU time in % listed by the CUDA profiler does not always add up to 100 for reasons of significant figures. 
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Table i: Profile for 1024 particles 



Function/Kernel No. of calls GPU Timers) CPU Timc(^s) GPU time (%) 



Interaction 


200 


790202 


792607 


51.40 


Hermite 


80 


537281 


539429 


34.95 


Sweeps 


300 


147786 


151312 


9.59 


Mem. Trans. 


2566 


22567 


41226 


1.46 


Keplerian 


100 


17116 


18308 


1.11 


Energy 


3 


11227 


11287 


0.73 


Drift 


200 


1770 


4007 


0.11 


Various 


1521 


9217 


27259 


0.57 



This table shows the fraction of time spent in different tasks for an integration of 1024 particles integrated 
for 100 timesteps on an NVIDIA GTX 285. The leftmost column lists the Kernels. 'Mem. Trans.' denotes 
the time involved in memory transfers. 



Table 2: Profile for 10240 particles 



Function/Kernel No. of calls GPU Timers) CPU Timers) GPU time (%) 

53^21" 

37.20 

8.30 

0.70 

0.39 

0.11 

0.01 

0.02 

Similar to Table [TJ This profile is for 10240 particles integrated for 100 timesteps on an NVIDIA GTX 285. 



Interaction 


200 


Hermite 


80 


Sweeps 


300 


Energy 


3 


Mem. Trans. 


2566 


Keplerian 


100 


Drift 


200 


Various 


1721 



2.29 x 10 7 


2.29 x 10 7 


1.60 x 10 7 


1.60 x 10 7 


3.58 x 10 6 


3.59 x 10 6 


302848 


302920 


171014 


316417 


48186 


49397 


8183 


10427 


15529 


35350 



Table 3: Comparing Kernel speeds for 10 3 , 10 4 and 10 5 particles 



Kernel 


No. Particles 


No. of calls 


Time /is 


GPU time (%) 




102400 




1.965 x 10 s 


77.26 


Interaction 


10240 


20 


2.286 x 10 6 


76.91 




1024 




79019 


71.84 




102400 




3.071 x 10 Y 


12.06 


Sweeps 


10240 


30 


356295 


11.98 




1024 




14776 


13.41 




102400 




2.689 x 10 7 


10.59 


Energy 


10240 


3 


307708 


10.35 




1024 




11210 


10.19 




102400 


507 


234894 


0.08 


Various 


10240 


477 


21849 


0.70 




1024 


457 


4986 


4.51 



This profile shows something akin to the asymptotic limit of simulations which do not have objects expe- 
riencing close approaches. Each simulation is of a single output of 10 timesteps on a 285 GTX with the 
specified number of particles. 
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the computations for our integrator on the GPU including all the evolution operators in our symplectic 
integrator (drift, Keplerian, interaction), all collision detection sweeps, the Hermite integration routine as 
well as the energy computation. As shown in the profiling, the execution time of a non-interacting system is 
dominated by the 0(N 2 ) interaction term; a kernel with a high degree of arithmetic intensity and minimal 
memory transfer compared to the number of floating point operations. Some of these routines show small 
GPU performance benefits over the CPU because of their large size in terms of registers, lack of arithmetic 
intensity in that they are O(N), and their inability to use shared memory. However, even these serial 
routines are executed on the GPU to ensure as few host to GPU or GPU to host memory transfers. This 
best practice is very important because it ensures fewer high latency memory transfers between the system 
memory and global memory on the GPU and again from global memory to the multiprocessors. The more 
time that can be spent doing floating point operations rather than memory transfer allows for the greatest 
GPU speed increases. Other high priority optimizations are to access shared memory over global memory 
whenever possible and to keep your kernels from having diverging execution paths. Some routines were not 
or could not be written to use shared memory, but all 0(N 2 ) sweeps and the interaction step leverage this 
tremendous performance enhancer. All kernels are written to minimize branching statements to ensure that 
execution paths do not diverge. This best practice manifests itself most obviously in the interaction detection 
and Hermite integration kernels. Collision detection was not incorporated in the same function, but rather 
in a separate kernel. In another example of code design choice, we over count and do N 2 computations in 
the interaction step instead of N(N — l)/2 since it allows for us to have simpler code that is more easily 
executed in parallel on the GPU. 

Medium priority optimizations including use of the fast math library that could potentially effect the 
accuracy of our code in a negative way are not implemented. We do use multiples of 32 threads for each 
block and we are able to attain a 33% occupancy rate for the interaction step, the energy computation and 
the GPU version of the Hermite gravity step as well as a 50% occupancy rate for the first and second sweep 
steps on a GF100 based GTX 480. The concept of occupancy is a complicated one and is explained in more 
detail in both the "Best Practices Guide - CUDA 3.0" guide and the "CUDA Programming Guide 3.0". 
At it's core it's simply a ratio of the number of active warps in a multiprocessor to the maximum number 
of possible warps the multiprocessor can maintain. A warp is simply a group of 32 threads to be executed 
on a multiprocessor. While computing the number of active warps per multiprocessor involves hardware 
knowledge beyond this document, and noting that higher occupancy does not necessarily equate to better 
performance, it is important to maintain a minimum occupancy. Below some value, the latencies involved 
with launching warps can not be hidden. The value suggested in the "Best Practices Guide - CUDA 3.0" 
suggest an occupancy of at least 25% be maintained. As mentioned, we are able to attain this. 

We also attempt to ensure optimal usage of register space per block and to loop unroll functions when it 
can increase performance. The number of loop unrolls that can be achieved is determined by examining the 
numbers of registers used per thread for a given function and multiplying it by the number of threads per 
block that you wish to issue. This gives the total number of registers used per multiprocessor - the group of 
CUDA cores that a block is issued to. This value must be smaller than the number of registers available on 
the relevant architectures multiprocessor. Increasing the number of loop unrolls in a given kernel can alter 
the register requirements, so there is a limit to the number of loop unrolls that can be implemented. For 
reference, the GT200 architecture (GTX 285) and the GF100 architecture (GTX 480), there are 8 and 32 
CUDA cores per multiprocessor, 16K and 32K 32-bit registers, and 16KB and 48KB per multiprocessor of 
shared memory respectivelyjj All relevant hardware numbers including number of threads, multiprocessors, 
cores per multiprocessor, amount of shared memory, etc. can be found in the CUDA Programming Guide in 
appendices A and G. It is possible to artificially restrict a function to use a number of registers that is less 
than it requires. However, this forces some values to be stored in local memory (global memory), which, as 
mentioned previously, carries a very heavy performance penalty in terms of latency. For this reason, it can 
be detrimental to limit the number of registers per thread below that which is required. We found that our 
performance was best with a loop unroll of 4, a maximum number of registers per thread set to 64, and 128 
threads per block. 



5 Doublc precision values are 64-bit. requiring 2 registers per value. 
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4-2. Use of Parallel Primitives 

The parallel computations used by this code for the most part utilize parallel primitives and so can be 
ported to other parallel computation platforms with similar parallel primitive libraries. The energy compu- 
tation, interaction step and all pairs sweep id entific ation routi nes essentially utilize the same tiled shared 
memory algorithm (descr ibed by Nvland. Harris, fc Prinsll2008l The drift step and encounter detection use 



a parallel prefix sum (see lHarris et al.ll200 



4-3. Areas for future optimization 

There are several areas in which our code could be further optimized. Without drastically altering the 
code there are some "low priority" optimizations such as the use of constant memory for unchanging values 
like smoothing lengths and the Hill factors. 

More important hardware level optimizations could include alternate ways to ensure global memory 
coalescing and prevent shared memory bank conflicts by padding our current data structures or disassembling 
them entirely and using single arrays of double precision elements to store data. 

On a higher level, there are several optimizations that could possible speed up the code significantly. 
First, replacing the i nteraction step which is currently an all-pairs 0(N 2 ) calculation with a GPU enhanced 
tree integration (e.g.. lRichardson et al.ll2000t iGaburov et al.ll2010l) could bring a speed increase to the code 



and allow larger numbers of particles to be simulated. 

Additionally, the sorting routines ('sweep kernels') could be optimized to use faster detection methods 
than the current 0(N 2 ) methods. By implementing a parallel sorting algorithm or simply making our 
detection routine more intelligent, we could reduce this part of the runtime. For simulation of many massless 
but colliding particles (such as dust particles) in the vicinity of planets and planetesimals an improvement 
in the encounter identification may allow us to integrate many more particles. 

A potentially simpler change would be the implementation of double precision in software rather than 
hardware. Double precision code on NVIDIA GPU's executes more slowly than single precision, as is often 
the case in hardware. Depending on the device, this factor can be anywhere from l/8th to 1/2 the speed 
and some older CUDA capable devices do not support double precision at all. By implementing double 
precision in software we would be able to compile our code using 32-bit floating point precision. This would 
have the benefit of multiplying the speed of the code you are running by a factor of at least a few and 
could possibly allow devices not originally capable of supporting the code to run. Additionally, it appears 
that only GF100 devices in the Tesla brand of cards will support full double precision speeds which run 
at 1/2 the number of double precision flops. A penalty of I/8th the number of single precision flops was 
observed while profiling our code on a GTX 480. This was observed indirectly by only noticing a speed 
increase of 2x on our GFI00 based machine for our double precision kernels like the interaction step and 
the energy computation over our GT200 based GPUs. At I/8th the peak double precision flops, the GFI00 
based GPU should be 2x faster than a GT200 based GPU because it has 2x the number of cores over the 
GT200 based cards (ignoring minor execution time differences based on the actual core clock speeds). The 
GT200 based GPUs are known to have a double precision flops peak l/8th that of the single precision peak. 
In practice, it should be noted that the GFI00 cards run faster than 2x that of GT200 GPUs due to other 
architectural optimizations between the cards. In particular, small kernels which have not been or do not 
parallelize as well run much faster on the GFI00 GPUs. Lastly, in addition to our code supporting more 
devices and running 2x the speed, the code would be easier to optimize to ensure global memory coalescing 
and to prevent shared memory bank conflicts due to the 4 byte size of single precision floating point. 

However, software implementation of double precision has downsides. Notably, it is often difficult to 
achieve the same precision as double precision floating point when using two single precision floating point 
values. Additionally, full conversion of the code to use single precision floating point values and redefining 
basic vector operations would involve extensive reworking of the kernels that could lead to dozens of extra 
operations in each particular kernel. It is not obvious that this sort of optimization is entirely suited to 
the GPU because increasing the number of operations in a kernel can increase register usage - an effect 
we are trying to avoid for optimization reasons. Finally, simply modifying the distance calculation to be in 
double precision and computing the rest of the acceleration step in single precision would result in the largest 
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performance benefit with the least amount of coding required but could potentially lose a large amount of 
precision in our calculations. 

Due to the previously mentioned register restrictions and the complexity of our code, it is often difficult 
to achieve high occupancy on the devices we use. It may be possible to rewrite entire routines in non-obvious 
ways to reduce the number of maximum registers in use at a given point in time to remedy this problem. 
This is only a " medium priority" optimization and is probably the most difficult due to the number of ways 
kernels can be re-written. This optimization is also hardware dependent and so may not be a worthwhile 
optimization. 

5. Summary and Conclusion 

We have described an implementation of a hybrid second order symplectic int egrator for planetar 



system integr ation that permi ts close approaches. It is similar in design to SyMBA (Du ncan et all 1199 



and Mercury ( ChambersLll999l ) but is written in CUDA and works in parallel on a GPU. The code is almost 



as accurate as the older integrators but is faster when many particles are simultaneously integrated. Bounded 
energy errors are observed during numerical integration of a few test cases implying that our integrator is 
indeed nearly symplectic. The code has been written primarily with parallel primitives so that modified 
versions can be written for other parallel computation platforms. The current version of the code does not 
take into account collisions between particles, however we plan to modify future versions so that these can 
be incorporated into the code. We also plan to modify the encounter algorithms so that many dust particles 
can be integrated in the vicinity of planets and planetesimals. 

Support for this work was provided by NSF through award AST-0907841. We thank Richard Edgar for 
the design and initial set up of our GPU cluster. We thank NVIDIA for the gift of four Quadro FX 5800 
and two GeForce GTX 280 video cards. We thank Nicholas Moore for informative discussions regarding 
GPU architectural and coding minutiae. 

Appendix A. Appendix: f and g functions 

Given a particle with position Xo and velocity vo at time to in Keplerian orbit, its new position, x, and 
velocity, v, at time t can be computed 

x = /x + gv 

v = /x +.gv (A.l) 

in terms of the / and g functions and their time derivatives, / and g. Introductory celestial mechanics 
textbooks often discuss / and g functions for particles solely in elliptic orbits. However if a particle in a 
hyperbolic orbit is advanced using elliptic coordinates, a NaN will be computed that can propagate via the 
interaction steps. It is desirable to advance particle positions for all possible orbits, including parabolic 
or hyperboli c ones. This can be done by computing the / and g functions with universal variables, as 
described by IPrussing fc Conwavl ( 19931 ). The recipe is repeated here as it is useful, but not available in 
most textbooks. 

With ±i = \/GM and a = 1/a where a is the semi-major axis and ro the initial radius, the / and g 
functions and their time derivatives in universal variables are computed as 

/ = 1--C(ax 2 ) 
ro 



x 



= (t-to)--=S( 



„2\ 



ax 



ax 6 S{ax*) - ll 



x vM r ,„3c^j 



rr 

1 - — C(ax 2 ) (A.2) 

r 
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(see equations 2.38 bv lPrussing fc Conwavlll993l ). Here the variable x solves the universal differential Kepler 
equation 

(A.3) 



y/JI(t — to) = — C{ax 2 ) + (1 — roa)x 3 S(ax 2 ) + rpx, 



V& 



(equation 2.39 lPrussing fe Conwavll 19931 ). The / and g functions must be computed before the time deriva- 
tives, / and /, so that r, the radius at time t, can be computed. This radius is then used to compute / and 



The needed transcendental functions are 



C{y) 



A. _ M. _l V— 

2! 4! ^ 6! 
1-cos ^/y 



V 
cosh V 



-S/-1 



and 



S(y) 



1 _ m. j- V- 

3! 5! ~r 7! 
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y 
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rv 
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i) 


> 





if 


ii 


< 






(A.4) 



(equations 2.40 bv lPrussing fc Conwavlll993l) . 

It is convenient to define the function F(x) and compute its derivatives 



F(x) = 
F'(x) = 
F"(x) = 



(r • v )x 



(r • v )a; ; 



C{ax 2 ) + (1 — roa)x 3 S(ax 2 ) + r^x 



(rp • vp) 



[l - o^S^ax 2 )] + (1 - r a)x 2 C(ax 2 ) + r 
1 - ax 2 C(ax 2 )] + (1 - r a)x [l - ax 2 5(aa; 2 )] 



(A.5) 



(equation 2.41, 2.42 and problem 2.17 bv lPrussing fc Conwavlll993h . 

To solve the universal Kepler equation ( equation IA. 31 that can now be written as F(x) — 0) the Laguerre 
algorithm can be computed iteratively (|Conwavi 119861 ) as 



Xi+l 



nF{xi) 



F'(xi) ±\(n- l) 2 F'(x t ) 2 - n(n - l)F(x l )F"(x l )\ 1/2 



(A.6) 



(equa tion 2.43 IPrussing fc Conwavlll993l) . Good numerical performance is found with n = 5 (jConwav , 
1986). The sign in the denominator is the same as th e sign of F'(xj ). The rate of convergence is cubic 



and convergence is achieved for any starting value of x (jConwavL |1986). Using a starting value xq — r$ we 
achieved convergence for a wide range of timesteps and orbital pa rameters to a level of 10~ 16 in under 6 
i terati ons. Improved choices for starting values of a; are discussed bv lPrussing fc Conwavl (|l993l ) and lConwav 
(|1986I ) but require more computation than xq — rp. 
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